{
  "cells": [
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/learn/search/multilingual/cohere-multilingual/cohere-multilingual-search.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/learn/search/multilingual/cohere-multilingual/cohere-multilingual-search.ipynb)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 1,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "Dlic9TAaRZOK",
        "outputId": "f358f449-c4e3-46cb-e6d6-4662d4b30381"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "\n",
            "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m23.0\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m23.0.1\u001b[0m\n",
            "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n"
          ]
        }
      ],
      "source": [
        "!pip install -qU datasets cohere 'pinecone-client[grpc]'"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "q4Yo7jvcRfm7"
      },
      "source": [
        "# Multilingual Search with Cohere\n",
        "\n",
        "Cohere released what might be the most advanced multilingual embedding model back in December 2022.\n",
        "\n",
        "Cohere's multilingual model supports **100+** languages and at the time of release provided *230%* better performance than the previous state-of-the-art in multilingual search.\n",
        "\n",
        "A key advance in the ability of this model (beyond pure performance), is the ability to create *meaningful* embeddings for longer text. Previous multilingual models would not produce quality embeddings for anything longer than a sentence of text, Cohere's `multilingual-22-12` model can do it for paragraphs of text.\n",
        "\n",
        "## Dataset\n",
        "\n",
        "We'll start by setting up our dataset for multilingual search. The dataset being used is the [Wikipedia multilingual dataset](https://huggingface.co/datasets/wikipedia).\n",
        "\n",
        "To download the dataset we do:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 2,
      "metadata": {
        "id": "ke0kEDaxRiKq"
      },
      "outputs": [],
      "source": [
        "from datasets import load_dataset\n",
        "\n",
        "en = load_dataset(\"Cohere/wikipedia-22-12\", \"en\", streaming=True)\n",
        "it = load_dataset(\"Cohere/wikipedia-22-12\", \"it\", streaming=True)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 3,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "RacgHwwX7T3D",
        "outputId": "543ba6ae-eb17-47d5-cac7-f380e6435003"
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "{'id': 0,\n",
              " 'title': 'Deaths in 2022',\n",
              " 'text': 'The following notable deaths occurred in 2022. Names are reported under the date of death, in alphabetical order. A typical entry reports information in the following sequence:',\n",
              " 'url': 'https://en.wikipedia.org/wiki?curid=69407798',\n",
              " 'wiki_id': 69407798,\n",
              " 'views': 5674.4492597435465,\n",
              " 'paragraph_id': 0,\n",
              " 'langs': 38}"
            ]
          },
          "execution_count": 3,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "next(iter(en['train']))"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 6,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "QzxnKOxU7aHm",
        "outputId": "dbd9081f-b512-4f5f-96cf-9d1db910f1b3"
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "{'id': 0,\n",
              " 'title': 'Italia',\n",
              " 'text': \"LItalia (, ), ufficialmente Repubblica Italiana, è uno Stato membro dell'Unione europea, situato nell'Europa meridionale, il cui territorio coincide in gran parte con l'omonima regione geografica. L'Italia è una repubblica parlamentare unitaria e conta una popolazione di circa 59 milioni di abitanti, che ne fanno il terzo Stato dell'Unione europea per numero di abitanti. La capitale è Roma.\",\n",
              " 'url': 'https://it.wikipedia.org/wiki?curid=2340360',\n",
              " 'wiki_id': 2340360,\n",
              " 'views': 3425.779427882056,\n",
              " 'paragraph_id': 0,\n",
              " 'langs': 307}"
            ]
          },
          "execution_count": 6,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "next(iter(it['train']))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Ih21iaHDkTSb"
      },
      "source": [
        "We have 6.46M English records, and 1.74M Italian records.\n",
        "\n",
        "If you like, feel free to use the full dataset — naturally this will cost money.\n",
        "\n",
        "For the sake of time and your pocket, in this demo we'll stick with a smaller set of ~100K records from each language. You can modify this number later as we get to the **Indexing** step."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "h2nBoRP0hsET"
      },
      "source": [
        "## Encoding with Cohere\n",
        "\n",
        "To embed our text using Cohere we need to first initialize our connection to Cohere. For this we need an [API key](https://dashboard.cohere.ai/api-keys), then we do:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 7,
      "metadata": {
        "id": "k6f7uyQUS8SS"
      },
      "outputs": [],
      "source": [
        "import cohere\n",
        "\n",
        "co = cohere.Client(\"COHERE_API_KEY\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "nrEkQKstih01"
      },
      "source": [
        "Given some text we embed it using the `multilingual-22-12` model like so:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 8,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "hB-TgoqripLD",
        "outputId": "e10d8c87-58e0-4fb0-c4d7-5dd4fbabda2a"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "768, 2\n"
          ]
        }
      ],
      "source": [
        "texts = [\"hi, how are you!\", \"ciao come va?\"]\n",
        "\n",
        "# create embeddings\n",
        "res = co.embed(texts=texts, model='multilingual-22-12')\n",
        "# pull embeddings from response\n",
        "embeds = res.embeddings\n",
        "print(f\"{len(embeds[0])}, {len(embeds)}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "I1w8XOt6lfqv"
      },
      "source": [
        "This shows that we have `2`  `768`-dimensional vectors, one for each *text* that we just encoded.\n",
        "\n",
        "That's it, we've created our embeddings — it's incredibly easy to do.\n",
        "\n",
        "Before we embed and index everything we'll need to initialize a vector index using Pinecone to store our embeddings within."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "YvZWNJ59mT-Q"
      },
      "source": [
        "## Creating Vector Index\n",
        "\n",
        "To create a vector index we need to initialize our connection to Pinecone, for this we need a [free API key](https://app.pinecone.io/) and then pass it below:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 10,
      "metadata": {
        "id": "9Lnij24pjCZm"
      },
      "outputs": [],
      "source": [
        "from pinecone import Pinecone\n",
        "\n",
        "pinecone.init(\n",
        "    api_key=\"PINECONE_API_KEY\",  # app.pinecone.io\n",
        "    environment=\"YOUR_ENV\"  # next to api key in console\n",
        ")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "aSkm1FIhnCfn"
      },
      "source": [
        "Now we can initialize the vector index. There are a few parameters we need to do this:\n",
        "\n",
        "* `dimension`: vector dimensionality, this must align to the embedding model dimensionality — for us this is `768`.\n",
        "\n",
        "* `metric`: the similarity metric being used to compare vectors. Different embedding models produce vectors that should be used with different metrics — in this case we need `'dotproduct'`\n",
        "\n",
        "* `pod_type`: choose `p1` for speed, `s1` for storage, or `p2` for *even more speed*.\n",
        "\n",
        "* `pods`: number of pods needed — for `p1` we can fit ~1M vectors on `1` pod, for `s1` we can fit ~5M vectors on `1` pod."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 11,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "5lD9CIEcnB4q",
        "outputId": "14b76180-c621-4d02-a86a-f8720838f456"
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "{'dimension': 768,\n",
              " 'index_fullness': 0.0,\n",
              " 'namespaces': {},\n",
              " 'total_vector_count': 0}"
            ]
          },
          "execution_count": 11,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "index_name = 'cohere-multi-wiki'\n",
        "\n",
        "# check if index already exists (it wont if this is first time running)\n",
        "if index_name not in pinecone.list_indexes().names():\n",
        "    # now create the new index\n",
        "    pinecone.create_index(\n",
        "        index_name,\n",
        "        dimension=len(embeds[0]),  # 768\n",
        "        metric='dotproduct',\n",
        "        pod_type='s1',\n",
        "        pods=1\n",
        "    )\n",
        "\n",
        "# connect to index\n",
        "index = pinecone.Index(index_name)\n",
        "# then check index status\n",
        "index.describe_index_stats()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "g1xEQnLS7nR0"
      },
      "source": [
        "With our embedding model and vector index setup we can move onto indexing *everything*."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "mEtnBBH87tSr"
      },
      "source": [
        "## Indexing Everything"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 12,
      "metadata": {
        "id": "T0Zdxqzfpnwk"
      },
      "outputs": [
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "76fefe319c30466eb7fa9626707b4915",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "  0%|          | 0/34 [00:00<?, ?it/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        }
      ],
      "source": [
        "from tqdm.auto import tqdm\n",
        "\n",
        "batch_size = 300\n",
        "lang_limit = 10_000  # number of records to index from each language\n",
        "\n",
        "data = {'en': iter(en['train']), 'it': iter(it['train'])}\n",
        "\n",
        "for i in tqdm(range(0, lang_limit, batch_size)):\n",
        "    # so that we don't go over the language limit set above\n",
        "    i_end = min(i+batch_size, lang_limit)\n",
        "    # we do for each language\n",
        "    for lang in ['en', 'it']:\n",
        "        # get the relevant batch\n",
        "        batch = [next(data[lang]) for _ in range(batch_size)]\n",
        "        # extract text\n",
        "        texts = [x['text'] for x in batch]\n",
        "        # create embeddings\n",
        "        embeds = co.embed(texts=texts, model='multilingual-22-12').embeddings\n",
        "        # create ids\n",
        "        ids = [f\"{lang}-{x['id']}\" for x in batch]\n",
        "        # we might also want to create metadata containing:\n",
        "        # the original text, title, url, and language\n",
        "        metadata = [{\n",
        "            'text': x['text'], 'title': x['title'], 'url': x['url'], 'lang': lang\n",
        "        } for x in batch]\n",
        "        # now we can index the batch\n",
        "        index.upsert(zip(ids, embeds, metadata))"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We check for the total number of vectors added to the index:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 13,
      "metadata": {},
      "outputs": [
        {
          "data": {
            "text/plain": [
              "{'dimension': 768,\n",
              " 'index_fullness': 0.0,\n",
              " 'namespaces': {'': {'vector_count': 20400}},\n",
              " 'total_vector_count': 20400}"
            ]
          },
          "execution_count": 13,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "index.describe_index_stats()"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Now we move on to querying."
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Making Queries\n",
        "\n",
        "We first define a `search` function to handle embedding, querying, and printing results."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 22,
      "metadata": {},
      "outputs": [],
      "source": [
        "import urllib\n",
        "\n",
        "def search(query: str):\n",
        "    # create query vector\n",
        "    xq = co.embed(texts=[query], model='multilingual-22-12').embeddings[0]\n",
        "    # search index\n",
        "    res = index.query(vector=xq, top_k=3, include_metadata=True)\n",
        "    # print results\n",
        "    for i, record in enumerate(res['matches']):\n",
        "        metadata = record['metadata']\n",
        "        # print key info\n",
        "        print(f\"{i+1}. {metadata['title']} ({metadata['lang']})\")\n",
        "        print(f\"  {metadata['url']}\")\n",
        "        print(f\"  {metadata['text'][:100]}...\")\n",
        "        # get translate link if not english already\n",
        "        if metadata['lang'] != 'en':\n",
        "            translate_url = \"https://translate.google.com/?sl=auto&tl=en&text=\"+urllib.parse.quote_plus(\n",
        "                metadata['title']+\"\\n\"+metadata['text']\n",
        "            )\n",
        "            print(f\"  Translate: {translate_url}\")\n",
        "        print()"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Let's try something not well covered by English wikipedia pages:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 24,
      "metadata": {},
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "1. Mostro di Firenze (it)\n",
            "  https://it.wikipedia.org/wiki?curid=658864\n",
            "  Uno dei testimoni principali dell'accusa contro Pacciani fu Giuseppe Bevilacqua, un funzionario dell...\n",
            "  Translate: https://translate.google.com/?sl=auto&tl=en&text=Mostro+di+Firenze%0AUno+dei+testimoni+principali+dell%27accusa+contro+Pacciani+fu+Giuseppe+Bevilacqua%2C+un+funzionario+dell%27American+Battle+Monuments+Commission+che+nel+1985+dirigeva+il+cimitero+americano+di+Firenze+in+localit%C3%A0+Falciani%2C+a+poche+centinaia+di+metri+dall%27ultima+scena+del+crimine+del+Mostro+in+Via+degli+Scopeti.\n",
            "\n",
            "2. Gianluigi Buffon (it)\n",
            "  https://it.wikipedia.org/wiki?curid=103015\n",
            "  Nell'estate 2009 viene ingaggiato come testimonial dalla \"poker room online\" PokerStars. Nell'ottobr...\n",
            "  Translate: https://translate.google.com/?sl=auto&tl=en&text=Gianluigi+Buffon%0ANell%27estate+2009+viene+ingaggiato+come+testimonial+dalla+%22poker+room+online%22+PokerStars.+Nell%27ottobre+2011%2C+si+ritrova+assieme+ad+Eleonora+Abbagnato%2C+in+uno+spot+pubblicitario+per+Ferrarelle.+Diversi+anni+dopo%2C+nel+2016%2C+subentra+a+Rocco+Siffredi+come+%22testimonial+pubblicitario%22+di+Amica+Chips%3B+mentre+nel+2018+tocca+alla+Birra+Moretti%2C+che+lancia+con+la+sua+immagine%2C+una+campagna+multimediale+chiamata%2C+%22Fai+ridere+Gigi+Buffon%22.\n",
            "\n",
            "3. Silvio Berlusconi (it)\n",
            "  https://it.wikipedia.org/wiki?curid=2749106\n",
            "  Dal 2020 Berlusconi è fidanzato con Marta Fascina (Melito di Porto Salvo, 9 gennaio 1990), deputata ...\n",
            "  Translate: https://translate.google.com/?sl=auto&tl=en&text=Silvio+Berlusconi%0ADal+2020+Berlusconi+%C3%A8+fidanzato+con+Marta+Fascina+%28Melito+di+Porto+Salvo%2C+9+gennaio+1990%29%2C+deputata+di+Forza+Italia+eletta+nel+2018+nella+circoscrizione+Campania+1.\n",
            "\n"
          ]
        }
      ],
      "source": [
        "search(\"who is giovanni falcone?\")"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Once you're done, delete the index to save resources."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "pinecone.delete_index(index_name)"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "---"
      ]
    }
  ],
  "metadata": {
    "colab": {
      "provenance": []
    },
    "kernelspec": {
      "display_name": "ml",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.9.12"
    },
    "vscode": {
      "interpreter": {
        "hash": "b8e7999f96e1b425e2d542f21b571f5a4be3e97158b0b46ea1b2500df63956ce"
      }
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}
